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ABSTRACT 

The  recently  advanced  philosophy  of  model  building  is  developed  further. 

It  is  stressed  how  Bayesian  inferences  based  on  the  posterior  distribution  of 
the  model  parameters  are  appropriate  only  after  sampling  theory  inferences 
based  on  the  predictive  distribution  of  the  data  fail  to  discredit  the  model. 

An  example  involving  the  normal  distribution  is  discussed  in  detail.  Diagnostic 
checking  functions  are  developed  which  can  be  applied  in  an  intuitive  sequential 
manner.  Careful  attention  is  also  given  to  the  nature  of  the  predictive  distri¬ 
bution  for  the  extreme  situation  where  information  about  the  parameters  is  very 
precise  or  very  vague.  For  the  latter  case,  it  is  illustrated  how  the  predic¬ 
tive  distribution  can  simultaneously  (i)  reflect  this  vague  information  in  an 
appropriate  manner  and  (ii)  allow  for  the  checking  of  the  adequacy  of  the  basic 
distributional  assumptions  such  as  normality  and  independence; 

A  particular  problem  in  the  interpretation  of  predictive  distributions 
arises  in  situations  involving  a  discrete  data-generating  distribution  with 
vague  prior  knowledge  about  the  parameter(s) .  This  problem  is  explored  in  depth 
for  the  case  of  the  binomial  distribution. 
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SIGNIFICANCE  AND  EXPLANATION 


The  objective  of  many  scientific  studies  is  to  develop  a  model  which  will 
provide  a  reasonably  simple  yet  sufficiently  adequate  representation  of  the 
phenomenon  under  consideration.  At  various  stages  of  a  scientific  investiga¬ 
tion,  a  confrontation  occurs  between  the  model  being  tentatively  entertained 
at  that  stage  and  the  data  that  have  been  collected  up  to  that  stage.  Model 
estimation  and  model  criticism  are  the  two  devices  which  are  used  by  the 
Investigator  in  performing  the  dual  roles  of  model  sponsor  and  model  critic 
that  are  necessary  for  the  advancement  of  knowledge.  This  paper  explores  in 
further  detail  a  viewpoint  of  model  building,  whereby  model  criticism  requires 
the  sampling  theory  made  of  statistical  inference,  while  model  estimation 
employs  the  Bayesian  mode  of  statistical  inference.  In  particular,  the 
Implications  of  having  only  vague  prior  information  about  the  model  parameters 
are  explored. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  authors  of  this  report. 


SOME  ASPECTS  OF  MODEL  ESTIMATION  AND  MODEL  CRITICISM 
Steven  P.  Bailey  and  George  E.  P.  Box 


1 .  Introduction. 

The  objective  of  many  scientific  studies  Is  to  de¬ 
velop  a  model  which  will  provide  a  reasonably  simple  yet 
sufficiently  adequate  representation  of  the  phenomenon 
under  consideration.  The  most  useful  models,  of  this  nature 
will  typically  be  those  which  elucidate  not  only  the  deter¬ 
ministic  relationships  among  the  variables  of  Interest  but 
also  the  stochastic  relationships  among  the  experimental 
errors  associated  with  these  variables.  (Here  the  meaning 
of  the  term  "deterministic  relationship"  Is  not  restricted 
to  mechanistic  relationships  derivable  from  existing 
theory;  rather  suitably  developed  empirical  relationships, 
such  as  polynomials,  are  also  considered  as  being  determinis¬ 
tic.)  In  this  paper  a  theory  of  model  building  recently 
advanced  by  Box  (1979b)  will  be  outlined  and  studies  In  furthe 
detail.  (See  also,  for  example,  the  following:  Box,  1979a; 
Box,  Hunter  and  Hunter,  1978;  Box  and  Jenkins,  1976;  Box  and 
Tlao,  1973;  Box  and  Youle,  1955). 

At  various  stages  of  a  scientific  Investigation,  a 
confrontation  occurs  between  the  model  being  tentatively 
entertained  at  that  stage  and  the  data  that  have  been 
collected  up  to  that  stage.  Model  estimation  and  model 
criticism  are  two  Inferential  devices  which  aid  the 
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Investigator  In  performing  the  dual  role  of  model  sponsor 
and  model  critic. 

Model  criticism  techniques  focus  on  the  question  of 
whether  or  not  there  Is  approximate  concordance  between  the 
data  currently  available  and  the  model  In  Its  current  form. 
If  some  particular  aspects  of  the  data  seem  to  be  dis¬ 
cordant  with  respect  to  the  model,  then  either  the  model 
will  need  to  be  appropriately  modified  In  an  attempt  to 
alleviate  the  model  deficiencies,  or  further  data  will  need 
to  be  collected  In  order  to  explore  the  Inadequate  aspects 
of  the  model.  The  broad  spectrum  of  diagnostic  techniques 
available  for  model  criticism  ranges  from  the  Informality 
of  examining  residual  plots  to  the  formality  of  carrying 
out  goodness  of  fit  tests,  but  It  Is  argued  that  all  such 
techniques  are  justified  by  the  sampling  theory  mode  of 
Inference. 

Model  estimation  Is  meaningful  only  when  the  appli¬ 
cation  of  the  above  model  criticism  techniques  falls  to 
reveal  any  model  Inadequacies.  If  such  Is  the  case,  then 
It  Is  appropriate  to  employ  Bayes*  Theorem  In  estimating 
the  unknown  model  parameters  by  obtaining  their  joint 
posterior  distribution.  The  use  of  Bayesian  Inference  In 
model  estimation  Is  a  logical  consequence  of  the  view  that 
a  model  Is,  In  effect,  a  joint  probability  statement  of  all 
assumptions,  both  explicit  and  Implicit,  which  are  to  be 


tentatively  entertained  at  the  current  stage  of  the 
Investigation. 

Now,  the  model  building  process  Is  Iterative  by 
nature,  and,  as  such,  there  Is  no  single  approach  that  will 
be  appropriate  at  every  stage  of  this  Iteration.  Thus  It 
Is  not  unreasonable  to  argue  as  above  that  two  different 
kinds  of  statistical  Inference  are  required  In  order  for 
an  Investigator  to  be  able  to  both  sponsor  and  criticize 
a  model . 

2 .  Model  specification  and  subsequent  Inferences. 

Denote  by  M  the  model  under  consideration  at  a 
given  stage  of  an  Investigation.  This  model  can  be  con¬ 
veniently  expressed  as  the  joint  distribution  of  the  po¬ 
tentially  observable  data  vector  y  and  the  unknown 
parameter  vector  e  and  can  thus  be  written 

p(y,9|M)  ■  p(y|0,M)p(0|M) .  (1) 

When  viewed  as  a  function  of  y  for  6  given,  p(y|6,M) 

Is  referred  to  as  the  data-generatlng  distribution  or,  more 
simply,  the  data  distribution;  when  viewed  as  a  function  of 
9  for  y  fixed,  p(y|9,M)  Is  the  likelihood  function.  This 

•m  **»  % 

factor  combines  with  the  prior  distribution  p(e|M)  ,  which 
as  argued  above  Is  an  essential  part  of  the  model,  to  yield 
the  complete  model  statement  given  by  (1). 


-3- 


1 


This  expression  can  be  subsequently  factored  as 


p(y.e|M)  »  p(e | y »M)p(y| m) 
%  %  %  %  % 


where 


p(y|M) 


*  |pW! 


e ,M)p(e|M)de 


p(e|y,M) 
%  1  % 


p(y|e,M)p(e|M) 

p(y|M) 


are,  respectively,  the  predictive  distribution  of  y  and 

X, 

the  posterior  distribution  of  e  ,  with  e  denoting  the 
parameter  space  of  6  .  Upon  obtaining  the  actual  data, 
which  will  be  denoted  by  y.  ,  the  posterior  distribution 

-  a 


p(e|^d»M)  * 


p(yd|e.M)p(e|M) 

p(ydlM) 


where  the  normalizing  factor  In  the  denominator  Is  seen  to 
be  the  predictive  density  (3)  evaluated  at  yd. 

The  predictive  distribution  (3)  acts  as  a  reference 
distribution  for  the  observed  data  vector  yd  ,  and  thus  an 
overall  portmanteau  check  of  the  "goodness"  of  the  model  M 
Is  given  by  the  probability 


Pr[p(y|M)  <  p(yd|M)|M], 


(6) 


If  this  probability  Is  unusually  small,  then  the  posterior 
distribution  p(0|yd»M)  provides  the  means  of  making 
Inferences  about  the  parameters  d  of  a  model  M  that  is 
of  questionable  relevancel  Stated  In  this  way,  the  role  of 
the  predictive  distribution  In  model  criticism  Is  justi¬ 
fiably  emphasized. 

Of  course,  due  to  Its  portmanteau  nature,  the  check 
(6)  does  not  comment  on  the  specific  ways  in  which  the 
n-dlmenslonal  data  vector  yd  may  be  In  discord  with  the 
model  M  .  Of  more  use  to  the  Investigator  will  be  Indi¬ 
vidual  checks  which  are  sensitive  to  particular  aspects  of 
possible  model  Inadequacy.  Thus,  If  a  particular  function 
of  the  data,  say  g(y)  ,  1$  appropriate  for  assessing  the 
validity  of  a  particular  model  assumption,  then  the  referral 
of  the  observed  value  g(yd)  to  Its  reference  distribution 
p(g(y)|M)  ,  obtainable  by  Integration  of  the  predictive 

a* 

distribution,  provides  a  formal  check  relative  to  the  as 
assumption  In  question.  Moreover,  Informal  model  criticism 
techniques  such  as  the  examination  of  residuals  can  be 
viewed  as  a  search  for  any  unusual  feature  of  yd  which 
suggests  model  Inadequacy;  so  that,  If  desired,  a  formal 
assessment  of  the  "unusualness"  of  such  a  feature  can  be 
effected  by  referring  an  appropriate  summary  measure  of 
the  feature  to  Its  reference  distribution  derivable  from 
p(y|M)  . 
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To  Illustrate  the  above  concepts,  an  example  Is  now 


considered. 


3 .  The  normal  distribution. 

Suppose  that  ^  3  (y j»...»yn)'  Is  a  vector  of  n 
Independent  observations  that  are  normally  distributed  with 
common  mean  0  and  common  variance  o2  ,  where  0  and  a2 
are  unknown  but  have  a  joint  prior  distribution  such  that 
conditional  on  a  given  o2  ,  0  Is  normally  distributed  with 
mean  0Q  and  variance  o2/ng  and  that  marginally  o2/v0Oq2 
has  an  Inverted  x2  distribution  with  vQ  degrees  of 
freedom.  We  will  not  necessarily  assume  that  nQ  is  an 
Integer  or  that  vg  “  no  "  **  however,  if  this  were  the  case, 
then  the  above  joint  prior  distribution  would  be  equivalent 
to  assuming  that  a  relevant  set  of  past  data  ■ 

(ygj,...,ygn  )'  had  been  combined  with  a  non-lnformati ve 


u  —  1  0 

prior  for  0  and  o2  so  that  0Q  *  yQ  *  —  I  y 


0  u«l 


Ou 


and 


i  0  _  , 

2  •  sn2  -  ~  z  (y n..  -  yn)  •  (See,  for  example.  Box 


’0  H  u-1  0u  '  J° 

and  Tlao,  1973.) 

Thus 


p(y | 0»o2 ,M) 


(2iro2 )  ^  exp{-_L_  [vs2  +  n(y  -  0)2]}  (7) 

2o2 


with  y  -  i  .  v 


n-l  ,  and 


s2  »  —  Z. (y  -  y)2 ;  and 
v  u»l  u 
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p(9  ,o2 | M)  ■  p(6|a2 ,M)p(a2 |M)  ,  (8 

p(e | a* ,M )  -  (2l2i)-*s  exp{-^(8  -  e_)l>  ,  (9 

"0  0 

v„.  .A  v0  -v#aj 

p(ol )H)  -  {r(-£)22  >-1(o*)  2  ’(v.o*)*-  e*P<  - >; 

2  0  0  2o2 


these  combine  to  give  the  complete  model  statement 


p(y,9,o2|M)  «  p(y|e,o2,M)p(9,o2|M) 


p(y|0»<J2  .M)p(0|o2,M)p(</|M)  .  (11) 


Now  It  Is  well  known  that  the  joint  posterior  distri¬ 
bution  of  9  and  o2  Is  such  that  conditional  on  a  given 
a2  ,  0  Is  normally  distributed  with  mean  0  ■ 

no0o  +  nJ  o2 

-  and  variance  - — tt  and  that,  marginally, 

nQ  +  n  n0  +  n 

o2/(Vq  +  n)32  has  an  Inverted  x2  distribution  with 

vq  +  n  degrees  of  freedom,  where  (vq  +  n)32  ■ 

nQn(y  -  9Q)2 


»A' 

v>0°o  +  vS*  +  — 


.  Thus 


“"IB 


Upp - y  - . *  ;  V  ■"  «  '  V  '  r  1  *-■  *"  . 


n0  ♦  n 


p(e»<y2|y,M)  -  p(e|y,o2,M)p(a2|y,M)  , 


(12) 


p(0|y,o2,M)  «  exp(— fi.— -  (e  -  e)2}  ,  (13) 

and 

V0  *  n  _fV0  *  n  +  ^ 
P(*2|y.M)  -  {r(-°  ~  n)2  2  }-1(a2 )  2 

V0  +  n 

„  2  * (vn  +  n)52 

*  +  n)a2}  exp  { - 9 - >  . 

0  2a2 

(14) 

If  (11)  is  factored  as 


p(y,0,o2|M)  *  p(0,o2 |y,M)p(y|M) 

^  %  % 


p(0|y,<J2  »M)p(o2  |y,M)p(y|M)  , 


(15) 


It  follows  that 


Vn  +  n 
-f-2 - ) 

P(y|M)  -  Cjffvo  +  n)3’]  12  , 


"r  ,  no  ,1.  rC('lo  *  n>/23  ,  ,.7^ 

■  •  2  ~~7Z~77\ - <  W  • 


r(v0/2) 


■i 


Model  criticism  will  Involve  using  the  above  pre¬ 
dictive  distribution  to  assess  whether  or  not  relevant 
aspects  of  the  observed  data  vector  yd  are  In  concor¬ 
dance  with  M  .  For  this  example*  the  most  useful  diag¬ 
nostic  sample  quantities  can  be  expected  to  directly  Involve 
% 

y  ,  s2  ,  and  the  standardized  residuals  r  ■  (y  -  yl)/s  ; 
thus  It  will  be  convenient  to  make  the  following  transforma¬ 
tion.  First,  make  an  orthogonal  transformation  from  y  to 
y  •  (Vj»...»vn)'  with  vn  ■  /n  y  ,  and  then  transform  from 
v  to  y  ,  s2  ,  and  u  ■  (uj , . . . , un _2 ) ' 


Ui  * 


1 


The  first  transformation  has  unit  Jacobian,  while  the 
second  transformation  has  Jacobian 


C2 ( s 2 ) 


y  -  1  n-2  uj  -fgl 

1  «  (1  +  -r  ) 

J-1  J 


n  *  1 

C2  ■  n**(n  -  1)  2  [  T(n  -  l)]"*4  . 


( 18) 


iMmp 


Hence , 


p(y tS2  •  u  j  M )  »  p(y|M)|J 


clc2 ( s 2  C(v0  +  n)d2]  '  2 


-  1 


vrt  +  n 


n-2  u5  -  (-  * 

x  n  (l  +  — )  2 

j-i  J 


(19) 


In  the  above,  the  factor  Involving  a  given  Is  pro¬ 

portional  to  the  standardized  t  density  with  j  degrees 
of  freedom,  so  that 


p(Uj|M)  «  a  j (1  +  ^ ) 


>  (20) 


HU  ±  i)/a 

( jir)J*r(  j/2) 


The  Uj  's  are  mutually  Independent  and  Indepepdept.of.  y 
and  s2  ,  so  that 

n-2 


p ( u  | M )  -  i  p(u.  |M)  •  C]1  (1*/) 

-  j»l  J  j=l  j 


>  (21) 


n-2 
n  a 
j-l 


i 


rn-l  \ 
«  2 
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p(y»s2 |m) 


ff  (sl)‘  O0  +  n)5 ” 


Writing  vps2  ■  vqOq  +  vs2  ,  where  vp 

*  +  v  *  Vq  +  n  -  1  ,  so  that  (vQ  +  n)d2  ■ 

.  n0n(y  -  80)2 

ii.Sg  +  - - — r— - -  .  (22)  then  becomes 

PH  Ha  +  n 


p(y.s2|M) 


CjC?  7 

(-Hh*1)2 


,  ,V0  ♦  ", 


where 


t  » 


(y  ~  ep) 
(J-  +  i^S 

vn0  n'  P 


The  above  establishes  that  transformation  from  y  to  t 
will  achieve  independence.  The  Jacobian  of  this  trans¬ 
formation  is 


!Z  .  fi_  ♦  il V  *  C.IVnSj*  . 


p w*w**r«' 


Hence, 


p(t,s*|H) 


p(t|M) 


In  summary,  then,  by  transforming  from  y  to 
t  ■  (y  -  e0)/(l/n0  ♦  l/n)1*s*  ,  F  «  sa/og  ,  and  the 

quantities  u  ■  (uj,. . .  ,un-2) '  defined  by  (17),  the 
predictive  distribution  of  y  In  (16)  can  be  alternately 
expressed  In  the  form 

n-2 

p(t.F,u|M)  •  p(t|M)p(F|M)  n  p(uJN)  (31) 

J-l  J 


which  Is  more  convenient  for  the  purposes  of  model  criticism. 

Nature  of  the  diagnostic  checks. 

The  above  quantities  can  be  utilized  In  the  following 
sequential  manner: 

(1)  The  vector  r  of  standardized  residuals  has  a 
sampling  distribution  which  Is  uniform  over  the  surface  of 
an  n-1  dimensional  hypersphere  of  radius  /rT=T  which 
lies  In  the  subspace  that  Is  orthogonal  to  the  vector  1 
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v 


In  n  space  (Andrews,  1971).  Since  this  sampling  distribu¬ 
tion  Is  Independent  of  e  and  o2  ,  It  will  also  be  the 
predictive  distribution  of  the  standardized  residuals.  Now, 
the  purpose  of  examining  residuals,  either  Informally  through 
residual  plotting  or  formally  through  the  explicit  considera¬ 
tion  of  any  suitably  chosen  function  g(r)  ,  Is  to  assess 
whether  or  not  r  lies  In  a  "suspicious"  direction  (Box, 

1960)  due  to  some  Inadequacy  In  theassumed  data-generatlng  dis¬ 
tribution  (for  example,  caused  by  a  lack  of  Independence  or  of 
normality).  Since  any  function  of  r  can  be  equivalently 
expressed  as  a  function  of  u  ,  the  suspiciousness  of  the 
observed  value  g(rd) («h(uj) )  of  any  function  g(r)(*h(u)) 
of  Interest  can  be  assessed  by  using  the  reference  distri¬ 
bution  p(h(u)|M)  obtainable  from  p ( u ) M )  . 

For  the  purpose  of  Illustration,  suppose  It  was  sus¬ 
pected  that  the  responses  y  might  be  approximately 
linearly  affected  by  a  variable  £  (say,  lab  temperature) 
which  takes  values  £d  ■  (£jd» •  •  *SnC|) '  for  the  n 
observations.  If.  In  terms  df  the  orthogonal  transformation 
from  y  to  v  used  earlier,  we  define  vn-1  «  c'y  with 
*  ■  <5d  *  5dl)/SEd  ,  wh,r«  ?d  •  I  S  5dj  and 


S?d  *  ,Jl«Jd 


£d)2  ,  then  a  checking  function  appropriate 


for  this  situation  would  be  un_2  »  as  defined  by  (17). 
Referral  of  the  observed  quantity  U(n_2)<i  to  the  t 
distribution  with  n-2  degrees  of  freedom  provides  a  check 


on  this  possible  departure  from  the  adequacy  of  p(y|8,o2,M). 

Note  how  this  check  links  up  with  the  techniques  of  residual 
plotting  (of  the  r^  's  vs.  the  Cjd  *s)  and  analysts  of 
variance  (where  it  turns  out  that  u|n_2)d  *s  the  mean 
square  ratio  MS, inear/KSresidual ). 

(2)  If  the  above  checks  based  on  the  Uj  's  and  the 
quantities  derivable  from  them  do  not  Invalidate  the  part  of  the 
model  given  by  the  data-generatlng  distribution,  then  atten¬ 
tion  can  be  shifted  to  the  appropriateness  of  the’  prior  dis¬ 
tribution.  In  particular,  referral  of  the  observed  quantity 

Pd  *  sd/<?o  t0  *he  P  distribution  with  v  and  Vg  degrees 
of  freedom  provides  a  check  on  the  concordance  of  the  sample 
estimate  of  a2  ,  sd  ,  with  the  prior  mean  for  o2  ,  Og  . 

(3)  A  check  on  the  concordance  of  yd,  the 
sample  estimate  of  8  ,  with  6g  ,  the  prior  mean  of  8  , 
is  provided  by  referring  the  quantity  td  * 

(7d  “  +  n)spd  t0  the  t  distribution  with 

Vp  *  vq  +  v  degrees  of  freedom.  Note  that  the  denominator 
of  td  utilizes  the  estimate  s2^  of  c2  which  results 
from  pooling  s£  and  o£. 

(4)  If  all  of  the  above  checks  do  not  Indicate 
model  Inadequacy,  then  the  Investigator  can  proceed  to  make 
Inferences  about  8  and  o2  from  p(8,o2|yd,M)  . 


In  particular, 

(I)  8  ,  the  posterior  mean  of  6  ,  Is  obtained  as 
a  weighted  average  of  yd  and  eQ  ,  and 

(II)  o2  ,  the  posterior  mean  of  o2  ,  Is  obtained  by 
pooling  Spd  with  the  single  degree  of  freedom 
estimate  nQn(yd  -  90) */ (nQ  +  n)  of  o*  . 


A  numerical  example. 

DeGroot  (1970,  p.  171)  gives  an  example  Involving  the 
present  model,  where  the  joint  prior  distribution  of  0  and 
o2  Is  chosen  to  satisfy  E(e|M)  «  2  ,  Var(0|M)  *  5  , 
E(o"*|M)»  3  ,  and  Var(o"2|  M)  «  3  .  In  our  notation, 

this  corresponds  to 


80  ■  E(e | M)  •  2  , 

o*0  -  •  j. 

v0  ’  2Cl,b  Var(a'2lM)l_1  ■  2 CC^-J ^3] -1  •  6  , 

«o  *  •  («)(!)$  • 


\  (32) 


1 

10  * 


DeGroot  then  uses  the  summary  statistics  yd  *  4,20 
n  _ 

and  Z  (y  A  -  y.)2  ■  vsd  ■  5.40  from  a  vector  yd  of  n»10 
u  * 1  ua  a 

observations  to  obtain  the  joint  posterior  distribution  of 
8  and  c*  ,  which.  In  our  notation.  Is  such  that  conditional 
on  a  given  o2  ,  8  Is  normally  distributed  with  mean 


-  .  Vo  *  .  (.11(2)  *  ( 10) (4.20)  .  42^2 

d  ng  +  n  .1+10  10.1 


and  variance  a2/(n0+n)  -  a2/10.1  ,  and  that  marginally 
o2/(v g  +  n)od  has  an  Inverted  x2  distribution  with 
vg  +  n  *  16  degrees  of  freedom,  where 

.  ,  nAn(y.  -  8A)2 

(vg  +  n)ad  *  v0a§  +  v$d  +  -2 - £ - 5 — 

n0  +  n 


«  6 ( j)  +  5.40  + 


( . 1 ) ( 10) (4.20 

.1  +  10 


7.88  . 


However,  before  making  Inferences  based  on 
p ( d ,o2 |yd»M)  ,  the  diagnostic  checks  discussed  above  should 
be  employed  to  see  whether  or  not  yd  Is  concordant  with 
M  .  In  particular,  by  referring 


Fd 


i 


ft 


1.80 


6  degrees 


to  the  F  distribution  with  v  ■  9  and  Vq  ■ 
of  freedom  and  referring 


t 


d 


-  V 

fl_  *  h*4 
Uq  +  n-J  spd 


4.20 


'«( 


-  2 _ 

j)  +  9( .60) 


h 


6+9 


2.20 


10.1) (7.40)1% 

15  J 


.986 


to  the  t  distribution  with  Vq  +  v  *  15  degrees  of  freedom, 
no  evidence  Is  seen  of  model  Inadequacy  with  respect  to  the 
prior  distribution.  Of  course,  a  thorough  criticism  of  the 
model  M  would  require  that  checks  based  on  the  standar¬ 
dized  residuals  r^  be  carried  out  to  assess  the 
appropriateness  of  p(y|0,a2,M)  ;  however  only  the  minimal 
sufficient  statistics  y^  and  s2  are  given  by  DeGroot. 

Extreme  cases  of  precise  or  vague  prior  knowledge.^  . 

The  diagnostic  checks  developed  for  the  present  model 
are  summarized  In  the  center  box  of  Table  1  .  The  rest  of 
this  table  Indicates  the  behavior  of  these  checks  when  the 
prior  Information  about  6  and/or  o2  Is  of  an  extreme 
nature  (either  very  precise  or  very  vague). 
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The  special  cases  where  the  prior  knowledge  about  o2 

Is  so  precise  that,  In  effect.  It  Is  assumed  that  a2  Is 

known  to  equal  o2  ,  have  been  considered  by  Box  (1979b,  pages 

13-19,  where  a2  In  his  notation  corresponds  to  oi/nn  In 

e  o  0 

the  present  notation).  Also,  the  special  cases  where  the 
prior  knowledge  about  e  Is  so  precise  that.  In  effect,  It 
Is  assumed  that  e  Is  known,  have  been  discussed  by  Box 
(1979b,  pages  24-26,  where  the  assumed  value  of  6  Is 
denoted  by  6q  In  the  present  notation). 

In  particular,  consider  the  case  where  there  Is  very 
precise  Information  about  both  6  and  o2  .  This  situation 
can  be  approximated  using  a  limiting  argument  where  nQ  -►  ® 
and  vq  -*■  •  .  On  this  argument,  the  portmanteau  predictive 
check  given  by  (6)  will  correspond  to  a  test  of  the  good¬ 
ness  of  fit  of  the  data  to  the  NUq.Oq)  distribution. 
Specifically,  this  test  can  be  carried  out  by  referring  the 
observed  value  of  the  statistic 

n  yu  -  eQ  2 

Q  -  Z  (— ~ '  °)  -  t2+vF  (33) 

u-1  <*Q 

to  the  x2  distribution  with  n  degrees  of  freedom,  where 
the  limiting  forms  of  t  and  F  are  as  given  In  Table  1  . 
For  a  discussion  of  the  relationship  between  precise  prior 
knowledge  and  significance  tests,  see  Box  (1979b). 
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Alternatively,  consider  the  situation  where  there  Is 
relatively  little  prior  information  about  either  e  or  a2  . 
This  could  be  reflected  by  values  of  n^  and  vQ  which  are 
very  small  relative  to  n  .  However,  in  lieu  of  an  explicit 
specification  of  ng  and  ,  the  posterior  distribution 

of  e  and  o2  can  be  numerically  approximated  in  a  suitable 
manner  by  using  a  limiting  argument  such  as  that  which  will 
be  developed  In  the  next  section.  The  consequence  of  this, 
though,  would  be  that  the  predictive  checks  Involving  t  and 
F  cannot  be  formally  made.  This  should  not  deter  the  Investi 
gator  from  rejecting  the  model  M  based  on  observed  values  of 
yd  and/or  sd  that  he  considers  to  be  extremely  unlikely, 
since  such  an  action  can  be  viewed  as  resulting  from  an 
Informal  check  which.  If  formalized  through  the  explicit 
consideration  of  ng  and  ,  would  result  In  unusually 

small  probabilities  Pr[p(t|M)  <  p(t.|M)|M]  and/or 
Pr[p(F|M)  <_  p(Fd|M)|M]  .  Further  discussion  of  this  point 
Is  given  by  Box  (1979b). 

Finally,  It  Is  noted  in  Table  1  that,  regardless  of 
the  nature  of  the  prior  distribution  p(6,o2|M)  ,  the  model 
checks  Involving  the  standardized  residuals  r  (or,  equiva¬ 
lently,  the  residual  functions  u,,...,un_2)  are  always 
available  for  the  Investigator  to  use  In  assessing  the  con¬ 
cordance  between  the  observed  data  yd  and  the  distributional 
form  p(y|e,o2,M)  . 
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4 .  Discussion. 

Further  Insight  into  the  Impact  that  vague  prior  know 
ledge  has  on  the  predictive  distribution  in  the  example  of 
the  previous  section  can  be  gained  by  conslder-ing  the  fol¬ 
lowing  argument. 

Suppose  that  the  Investigator  wishes  to  characterize 
the  relatively  little  prior  Information  about  0  and  a2 
by  utilizing  a  prior  In  which  0  and  $q(o2)  are  locally 
uniform  and  Independent,  where 


<J>q(o2) 


,  q  M 
An  o  ,  q  «  0 


(34) 


so  that 


p( 0 ,o2 | M) 


(35) 


The  relevant  posterior  distributions  based  on  the  above 
prior  are  given  by  Box  and  Tlao  (1973,  Section  2.4.6).  In 
the  context  of  the  previous  section,  p(0,o2|y,M)  for  the 
above  situation  can  be  obtained  as  the  limiting  case  of 

(12)  -  (14)*,  where  nQ  ■+•  0  ,  vQ  -(q  +  1)  and 


*Note  In  passing  that  the  limiting  case  where  ng  -*■  0 
and  vo  -*■  0  will  correspond  to  using  a  prior  obtained  from 
Jeffreys'  Rule  in  the  case  there  0  and  o2  are  not  assumed 
to  be  Independent  a  priori.  (See,  for  example.  Box  and  Tlao, 
Section  1.3.6.)  However,  somewhat  paradoxically,  the  resulting 
prior  p(6,o2|M)  *  (a2 ) 3/2  would  have,  from  (34)  and  (35), 
the  interpretation  that  0  and  o‘>  are  locally  uniform  and 
Independent  a  priori. 


i 


m 


♦  0  »  since  p(e»a2|M)  as  given  by  (8)  -  (10)  corresponds 
in  the  limit  to  the  prior  (35)  above  when  terms  not  involving 
e  and  o2  are  ignored. 

Using  (22)  and  ignoring  terms  that  do  not  involve 

either  y  or  s2  ,  It  follows  that 

i  -  1 

P(y.s2 | M)  «  (s2)2  (36) 

In  the  above  limiting  case,  so  that  the  predictive  distri¬ 
bution  of  the  maximum  likelihood  estimates  6  s  y  and 
a2  *  (n-l)s2/n  behaves  like 

I  -  1 

p(§  t$2 1 M)  «  (82)2  .  (37) 

By  comparing  (37)  with  (34),  the  following  intuitively 
reasonable  observation  can  be  made:  If  the  local  uniformity 
and  Independence  of  e  and  <frq  is  assumed  a  priori,  the 
behavior  of  the  predictive  distribution  Is  such  that  the 
maximum  likelihood  estimates  §  and  $q  «  $q(o2)  are 
similarly  uniform  and  Independent.  Stated  another  way.  If 
the  Investigator  believes  a  priori  that  a  wide  range  of 
values  for  (e»$q)  are  equally  plausible,  then  the  predic¬ 
tive  distribution  for  (e,$q)  will  appropriately  reflect 
this  state  of  Indifference. 
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It  Is  of  Interest  to  consider  whether  similar  results 
hold  In  other  modeling  situations.  In  particular,  an 
example  Involving  a  discrete  data-generatlng  distribution  Is 
now  Investigated. 

5 .  The  binomial  distribution. 

Consider  a  situation  In  which 

(1)  N  Bernoulli  trials  are  performed,  In  each 
of  which  the  probability  of  success  Is  9  , 
(11)  uncertainty  about  9  Is  expressed  as  a 

beta  prior  with  parameters  b-|  and  b2  ,  and 
(ill)  the  Investigator  can  observe  the  number  of 
successes,  say  Y  ,  that  occur  in  the 
N  trials. 

The  model  M  corresponding  to  the  above  Is  thus 

p( Y ,9 | M)  -  p(Y|0,M)p(9|M)  ,  (38) 

where 


p(Y|e.M)  •  (1J) eY( *1  -  0)N‘Y  (Y  -  0 . N)  (39) 


and 


P(9|M) 


r(b1  +  b2) 
r(b1)r(b2) 


b.-l 
9  1  (1 


e) 


b.-l 


(40) 
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Subsequently, 


P(9|Y,M) 


r(N+bi +b2) 

F( Y+bT )T(N-Y+b2) 


0Y+bi -l(i 


N-Y+bo-l 

0) 


(41) 


and 


P  ( Y  |  M ) 


N  r(bT  +  b2)  r(Y  +  bi )r(N  -  Y  +  b2) 
W  r(b1)r(b2)  r(N  +  +  b2) 


(42) 


so  that  once  the  actual  observation  Yd  becomes  available 
to  the  Investigator,  the  posterior  distribution  p(0|Yd,M) 
can  be  obtained  from  (41)  and  a  check  on  its  relevance  can 
be  made  by  referring  Yd  to  the  predictive  distribution 
(42).  This  predictive  distribution,  which  is  sometimes 
called  the  beta-binomial  (see,  for  example,  Kendall  and 
Stuart,  1969,  page  146)  Is  such  that 


>  (43) 


where  0q  »  b-j / ( b*j  +b2 )  and  Nq  *  b>j+b2  . 

For  the  situation  where,  there  Is  relatively  precise 
prior  knowledge  about  6  In  that  Nq  Is  very  large  In  com¬ 
parison  to  N  ,  the  predictive  check  Involving  Y  will 
approximately  correspond  to  the  Neyman-Pearson  significance 


E(Y|M) 
Var( Y | M ) 


b-j  +b» 


N0, 


bi b2(bi+b2+N) 

N(b1+b2)2(b1+b2+1 )  *  N90^-60)  N^+T 


n0+n 


test  of  the  null  hypothesis  0  «  0 


(This  correspondence 


0  * 

becomes  exact  as  Nq  ♦  •  .) 

Of  more  practical  Interest  Is  the  situation  where  there 
Is  relatively  little  prior  knowledge  about  0  In  that  Nq 
Is  small  In  comparison  to  N  .  Arguing  as  In  the  example 
of  the  previous  section,  the  Investigator  may  wish  to  charac¬ 
terize  this  lack  of  prior  Information  by  employing  a  prior 
which  Is  locally  uniform  In  some  appropriate  metric 
♦  *  $ (0 )  .  Three  particular  choices  for  this  metric 
are  now  considered  In  detail. 

Case  1 :  »  «  0  . 

Since  the  admissible  range  0  £  0  <_  1  Is  finite,  a 
prior  for  0  which  Is  globally  uniform  (rather  than  just 
locally  uniform)  can  be  used.  This  prior  would  correspond 
to  the  choices  by  ■  b£  *  1  In  (40),  so  that  eQ  *  .5 
and  Nq  *  2  In  (43) . 

Figure  1  shows,  for  N  =  10  ,  the  joint  distribution 
p ( 0 , e ) M )  for  this  situation,  where  0  «  Y/N  Is  the  maximum 
likelihood  estimate  of  0  .  From  (42), 

p( Y | M)  •  jfry  (Y  -  0, . . .  ,N)  (44) 

so  that  a  uniform  prior  on  0  over  the  Interval  0  <  6  <  1 


results  In  a  discrete  uniform  predictive  for  6  ,  assigning 
probability  1/(N+1)  to  each  of  5  *  0,^-,...,l  .  Note  how  this 
relationship  between  the  prior  for  0  and  the  predictive  for 
6  Is  analogous  to  the  relationship  between  the  prior  for  the 
normal  model  parameters  and  the  predictive  for  their  maximum 
likelihood  estimates  that  was  discussed  In  the  previous  section. 

To  further  develop  the  prior-predictive  correspondence 
In  the  present  example,  consider  the  predictive  cumulative 
distribution  function  of  0 


’  0  ,  t  <  0 

V1'  ■  “jft1  •  0  i  *  s. ' 


(45) 


l  1  ,  t  >  1 

where  [Nt]  denotes  the  Integer  part  of  Nt  .  Since  the 
prior  cumulative  distribution  function  of  0  is 

r  0  ,  t  <  0 


Fe(t> 


t  ,  0  <  t  <  1 


l  1  ,  t  >  1 

It  Is  easy  to  see  that  F*(t)  converges  In  distribution  to 

e 

Fa(t)  ,  since  ([Nt]+1  )/(N+l )  t.as  N  -  -  for  all  0  <  t  <  1  . 

9 


Case  2:  ♦  »  s1n~^/e. 

This  metric  Is  recognized  as  the  familiar  asymptotic 
variance  stabilizing  transformation  for  the  binomial  distri¬ 
bution.  It  Is  also  the  metric  In  which  an  approximately 
nonlnformatlve  prior  distribution,  as  determined  by  Jeffreys' 
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Rule,  will  be  locally  uniform.  (See,  for  example,  Box  and 
Tlao,  1973,  Sections  1.3.4  and  1.3.5.)  However,  since  the 
admissible  range  0  <_  sin”!/?  <.  ir/2  is  finite  for  this  metric, 
a  globally  uniform  prior  can  be  considered.  In  terms  of  the 
original  metric  9  ,  this  prior  will  correspond  to  the  choices 
bi  *  b£  *  H  in  (40),  so  that  9g  -  .5  and  Ng  =  1  in  (43). 

From  (42), 

p(Y|M)  »  1  LLL±  LLLzJLi.  *?■)  ( Y  *  0, , . .  ,N) .  (47) 

ir  T(Y  +  1)  T(N  -  Y  +  1) 

Note  that  this  Is  a  symmetric,  "u-shaped"  discrete  distribution 
and,  as  such,  the  value(s)  of  Y  having  smallest  probability 
will  be  Y  *  N/2  for  N  even  and  Y  »  (N+1)/2  for  N  odd  . 
Hence,  all  values  of  the  maximum  likelihood  estimate  <J>  * 
sin"!/?  a  sin“! /Y/N  will  not  be  equl probable  1_n  the  prjedjct.1  v.e 
sense.  The  failure  of  the  uniform  prior  for  $  *  sin"!/?  to 
produce  a  uniform  predictive  distribtulon  for  $  In  this  case 
Is  seemingly  inconsistent  with  the  logical  findings  in  the 
examples  previously  discussed  (the  normal  example  In  Section  4 
and  Case  1  above  for  the  binomial  example). 

Further  comparison  of  the  present  case  ( <J>  *  sin'!/?) 
with  the  previous  case  (4>  ■  9)  reveals  a  possible  source 
of  the  above  apparent  Inconsistency.  For  the  <J>  «  9  case, 
with  N  fixed,  the  N+l  possible  realizations  of  the  esti¬ 
mate  9  (l.e.,  9  «  1/N  for  1  ■  0 . N)  are  evenly 

A 

spread  over  the  Interval  0  <  8  <  1  corresponding  to  the 
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admissible  range  for  0  .  This,  along  with  the  uniform 
prior  for  0  ,  results  In  predictive  probabilities  for 

A 

0  which  are  uniformly  distributed  over  these  N+l  possible 
values.  However,  for  the  <j>  ■  sln"^/©  case,  the  N+l 
possible  realizations  of  the  estimate  $  (l.e.,  $  ■ 

sln’Vl/N  for  1  ■  0,...,N)  are  unequally  spaced  over  the 

*  IT 

Interval  0  <  ^  <  j  corresponding  to  the  admissible  range 
for  $  (although  they  are  spaced  symmetrically  about  the 
midpoint  of  this  range,  ir/4).  Note  that  this  unequal 
spacing  Is  such  that  the  possible  values  for  $  become 
more  spread  out  as  one  moves  away  from  the  midpoint  and 
towards  either  end  of  the  admissible  Interval.  It  can 
thus  be  heurlstlcal ly  argued  that,  due  to  the  continuous 
uniform  prior  for  4>  ,  the  discrete  predictive  distribution 
of  $  compensates  for  the  nonuniformity,  per  se,  of  the 
spacing  of  possible  $  values  by  assigning  larger  probabili¬ 
ties  to  those  values  which  are  further  away  from  the  midpoint 
tt/4  ,  In  accordance  with  the  Increasingly  spread  out  nature 
of  these  values.  The  result  of  this  compensation  Is  that 
the  predictive  probabilities  of  different  Intervals  for  $ 
having  the  same  length  are  more  nearly  equal  than  they  would 
be  If,  say,  the  predictive  probabilities  of  the  N+1  possible 
values  for  $  were  all  equal. 


mmm 
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A  formal  justification  of  the  above  heuristic  argument 
can  be  developed  by  comparing  the  prior  cumulative  distri¬ 
bution  function  of  $ 


t 

ff/2 

1 


t  <  0 


0  <_  t  <_  J* 


t  >  TT 
’  X>  1 


(48) 


with  the  predictive  cumulative  distribution  function  of  $ 


V‘> 


0  ,  t  <  0 

1  rn  ■«)  0<t< 

i«o  *  r ( i  ♦  i)  r(N  -  i  +  i)  - 

l  *  t  >  ^  . 


(49) 

It  turns  out  that  F*(t)  converges  In  distribution  to 

♦ 

F^(t)  ,  a  result  which  Immediately  follows  from  the  fact 

A 

that  the  predictive  distribution  of  6  converges  to  the 
prior  distribution  of  0  for  any  choice  of  b^  >  0  and 
bg  >  0  In  (40)  (and,  in  particular,  for  the  choice 
bi  *  b2  *  h  pertaining  to  the  present  case).  The  proof 
of  this  fact  Is  given  In  the  Appendix;  verification  of  this 
fact  for  the  particular  choice  b>|  ■  bg  *  1  was  given  In 
the  discussion  of  Case  1  above. 
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Visual  Insight  as  to  the  nature  of  the  above  result 
can  be  gained  by  considering  Figures  2a,  b  and  c,  which 
show  for  N  ■  20,  50  and  100  ,  respectively,  the  predictive 
cumulative  distribution  function  of  $  ■  sln'VY/N  (the  solid 
line  In  each  figure)  as  compared  with  the  uniform  prior  cumu¬ 
lative  distribution  function  of  *  sln'^/e  (the  broken 
line  In  each  figure).  Notice  that,  even  for  N  »  20,  these 
two  distribution  functions  are  In  close  agreement  except  at 
the  extreme  ends  of  the  Interval  0  <  $  <  £  ,  where  the  d 1 s - 
agreement  will  necessarily  be  accentuated  due  to  the  discrete 
nature  of  $  and  the  way  In  which  the  possible  realizations 
of  $  are  spread  along  this  Interval. 

Thus,  to  summarize  this  second  case,  It  has  been 
t  the  predictive  distribution  (47),  which  at 
first  glance  seems  to  betray  the  vague  prior  Information 
about  4  •  sln'1/e  that  Is  used  In  generating  It,  can  upon 
closer  look  be  Interpreted  In  a  sensible  manner  when  expressed 
In  terms  of  $  ■  si n- VV/h  ,  since  It  Is  the  unequal  spacing 
of  the  possible  $  values  that  produces  this  u-shaped 
distribution. 


Case  3:  ♦  •  An (jTqJ • 

This  metric  corresponds  to  the  logarithm  of  the  odds 
In  favor  of  observing  a  success  as  the  outcome  of  a  single 
Bernoulli  trial.  Consideration  of  the  "log  odds"  as  an 


FIG.  2b  PREDICTIVE  C.D.F.  6F  SIN"1-/!,  N=50 


appropriate  metric  in  which  to  express  prior  ignorance  has 

been  advocated  by  several  authors.  (See,  for  example, 

Lindley,  1965,  Section  7.2  .) 

In  terms  of  the  original  metric  9  ,  a  locally  uniform 
•  ,  6  . 

prior  for  *  *  *n(i7eJ  will  b«  such  that 


p(0 I M) 


0  "  * ( 1  -  9) 


Note  that,  unlike  the  previous  two  cases  discussed,  the 
admissible  range  for  the  present  choice  of  <f>  is  infinite. 
(Specifically,  this  range  is  the  extended  real  line, 

-•  <  ♦  <  •  .)  Hence  it  does  not  make  sense  to  talk  about  a 
globally  uniform  prior  for  <P  .  Also  notice,  however,  that 
if  terms  not  involving  9  are  ignored,  then  (50)  can  be 
obtained  as  the  limiting  case  of  (40)  with  0  and 

b2  0  (or,  equivalently,  with  NQ  *  0  for  9g  *  %  fixed, 
using  (43)). 

It  is  precisely  this  limiting  argument  that  Dawid 
(1979)  uses  as  an  illustration  of  a  situation  where  the 
appropriateness  of  model  checking  via  the  predictive  dis¬ 
tribution  Is,  in  his  view,  questionable.  Specifically,  he 
notes  that  the  limiting  form  of  the  predictive  distribution 
for  this  example  Is 

Pr(Y  -  0 | M)  -  Pr( Y  «  N|M)  »  h  ,  (51) 


so  that.  In  Dawid's  words,  "any  value  0  <  Y  <  N  discredits 


this  'model-cum-prlor' . " 

It  should  be  noted,  however,  that  although  Y  ■  0 
and  Y  »  N  have  a  combined  predictive  probability  of  unity 
In  the  limit,  the  corresponding  limiting  posterior  distribu¬ 
tion  of  $  (or,  for  that  matter,  of  e  )  does  not  exist 
(l.e..  Is  Improper)  for  both  of  these  values  of  Y  !* 

To  better  understand  what  Is  happening  for  this  case, 

It  Is  worthwhile  to  take  a  closer  look  at  this  limiting 
process.  For  any  fixed  Ng  >  0  ,  with  0g  ■  %  also  fixed, 
the  prior  distribution  for  0  Is  equivalent  to  a  globally 
uniform  prior  In  the  metric 

!!«-, 

J9  [t(l  -  t)]2  dt  .  {52) 

(In  particular,  the  choices  Mg  ■  2  and  Ng  *  1  cause  (52) 
to  correspond  to  0  and  sin*1/?  ,  respectively;  these  were 
the  previous  two  cases  discussed.)  The  result  in  the  Appendix 
can  then  be  applied  to  conclude  that  the  predictive  cumulative 
distribution  function  of  the  maximum  likelihood  estimate 

*It  Is  Interesting  to  note  that  Llndley  (1965)  takes 
the  view  that  "reliable  Inferences  cannot  be  made  about  the 
ratio  of  successes  to  failures  until  an  example  of  each  has 
occurred"  whereas  Bernardo  (1979)  finds  using  the  Case  3  prior 
to  be  "less  than  adequate"  In  comparison  to  using  the  prior 
discussed  In  Case  2. 


i 


r/H  f- 1 

/  [t(l  -  t)r  dt  (53) 

of  the  metric  (52)  converges  to  the  uniform  prior  cumula¬ 
tive  distribution  function  for  this  metric.  Thus  the  argu¬ 
ments  supporting  the  reasonableness  of  the  predictive  dis¬ 
tribution  for  Cases  1  and  2  will  also  apply  to  the  present 
general  situation  where  Nq  >  0  . 

However,  setting  Nq  ■  0  (rather  than  Nq  small  and 
positive)  causes  the  metric  (52)  to  correspond  to 

g 

tn(y3j)  and  thus  have  an  Infinite  admissible  range  (rather 
than  the  finite  range  obtained  with  any  Nq  >  0  ).  Further¬ 
more,  the  Appendix  result  cannot  be  used  as  a  formal 
argument  supporting  the  predictive  distributional  form. 
Nevertheless,  the  heuristic  argument  used  In  Case  2  to  justify 
the  nature  of  the  predictive  distribution  there  applies  here, 
too;  since  Y  ■  0  and  Y  *  N  yield  maximum  likelihood  estl- 
mates  $  *  -•  and  $  »  •  ,  respectively,  for  $  ■  tn(- — ) 
which  are  so  far  removed  from  the  other  possible  realizations 
of  $  that.  In  order  to  appropriately  reflect  the  Improper 
prior  for  $  over  the  whole  extended  real  line,  the  pre¬ 
dictive  distribution  for  $  assigns  probability  \  to  each 
of  $  »  +«  .  (In  other  words,  this  Is  an  extreme  case  of  a 
"u-shaped"  distribution  caused  by  the  discrete  nature  of  £ 
and  the  unequal  spacing  of  Its  realizations.) 
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6 .  Further  remarks  on  the  binomial  example. 

The  binomial  example  just  discussed 
Illustrates  how  care  should  be  exercised  In  the  Interpre¬ 
tation  of  predictive  distributions  which  arise  from  discrete 
data-generatlng  distributions  In  situations  where  the  prior 
Information  about  the  parameters  Is  vague. 

In  this  example  It  was  explained  how,  given  a 
reasonable  representation  of  vague  prior  knowledge  with 
respect  to  some  metric  of  e  ,  the  predictive  distribution 
of  the  maximum  likelihood  estimate  of  the  metric  of  Interest 
will  appropriately  reflect  this  Ignorance.  The  three  most 
popular  candidates  for  representing  the  vague  prior  Infor¬ 
mation  were  each  discussed  Individually.  (For  the  problem 
of  deciding  which  of  these  three  choices  should  be  preferred 
over  the  others,  the  reader  Is  referred  to  Section  3.4  of 
Bernardo,  1979,  and  the  references  In  that  paper.  ) 

It  should  be  noted  that,  although  these  three 
choices  give  rise  to  quite  different  predictive  distributions 
for  Y  ,  the  corresponding  posterior  distributions  for  e 
will  not  differ  greatly  In  most  cases,  when  N  Is  not  too 
small.  Specifically,  Good  (1965)  comments  that  "when  there 
are  more  than  three  successes  and  three  failures,  there  Is 
little  difference  between  the  three  methods.  .  .for  many 
practical  purposes." 
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Finally,  note  that  from  the  beginning  It  has  been 


assumed  that  only  the  number  of  successes  Y  that  occur  In 
the  N  trials  Is  observed  by  the  Investigator.  Suppose, 
Instead,  that  each  of  the  N  actual  Bernoulli  trial  results 

Is  observed  Individually,  so  that  a  vector  y  ■  (yj . yN) ' 

of  zeros  and  ones  corresponding  to  failures  and  successes, 
respectively,  represents  the  experimental  results.  Then,  in 
terms  of  the  previous  model  p(Y,6|M)  given  by  (38),  the 
relevant  model  now  becomes 

p(y,e|M)  ■  p(y|e,M)p(e|M) 

-  P(y | Y ,M)p(Y ,0 | M)  . 

where  the  additional  factor 

(y)-1  For  x'x  *  Y 
0  otherwise 

can  be  viewed  as  the  conditional  predictive  distribution  of 
y  given  Y  and  can  thus  be  used  In  obtaining  diagnostic 
checks  for  departure  from  the  assumed  form  of  the  data- 
generatlng  distribution  caused  by,  say,  serial  dependence. 


p(y|Y,M)  - 


(54) 


(55) 
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7.  Summary. 


This  paper  has  dealt  with  an  approach  to  model 

building  whereby  a  sampling  theory  argument  is  used  In 
criticizing  a  tentati ve -model  by  referring  diagnostic  functions 
of  the  observed  data  to  their  appropriate  reference  predictive 
distributions,  while  a  Bayesian  argument  Is  used  In  esti¬ 
mating  the  model  parameters  via  the  posterior  distribution. 
Examples  Involving  the  normal  and  the  binomial  distributions 
Illustrated  the  main  points  of  this  approach.  Particular 
attention  was  given  to  situations  In  which  prior  knowledge 
about  the  parameters  was  vague  and,  for  the  binomial  case, 
particular  difficulties  In  predictive  distribution  Interpre¬ 
tation  were  discussed. 


Appendix. 


Proof  that  the  predictive 
distribution  of  e  converges 
to  the  prior  distribution  of 
0  for  the  binomial  model. 


The  prior  cumulative  distribution  function  of  0 
and  the  predictive  cumulative  distribution  of  §  are, 
respectively. 


Fe(t) 


0  ,  t  <  0 

t  r(b,  +  b2)  b^l  b2-l 

/  -  u  (1  -  u)  du  ,  0  <  t  <  1 

o  r(b1)r(b2) 

1  ,  t  >  1 


(Al) 


,  t  <  0 


.  ,N,  r(bi+b2)  r(i+b, )r(M-i+b2) 

( t)  *  \  s  (J - .  0  <  t  <  t 

e  1*0  1  r(b1)r(b2)  r(N+b1+b2) 


,  t  >  i  . 


These  agree  exactly  on  the  Intervals  t  <  0  and  t  >  1  . 

r(b,+b9)  r(N+b2) 

For  t  ■  0  Ffl(0)  *  0  ,  while  Fa(0)  *  - - — - 


r(b] )  r(N+b|+b2) 


T(N+b2)  -b, 

•  s  N  --  .  Since  f(-„-;b)tb2)  beheves  like  N  2  for 

N  large  ,  as  can  be  verified  from  Stirling's  series.  (See, 
for  example,  Box  and  Tlao,  1973,  Appendix  A2.2.)  It  thus 
suffices  to  show  that 


"  ’VP « 


[Nt]  r(i  +  b^rCN  -  1  +  b2) 
1-0  ll  r(N  +  b1  +  b2) 


t  b,-l  b2-l 

f  u  (1  -  u)  du 


(A3) 

as  N  -►  »  for  0  <  t  <  1  . 

Now,  the  summand  on  the  left-hand  side  of  (A3)  can 
be  written 


r(N  +  l)  r(i  »  b^  r(N  -  i  +  b2) 

r(N  +  b1  +  b2)  r ( i  +  i)  r(N  -  i  +  i) 

which  behaves  like 


1 -b,-b«  b.-1  b2-l 

N  1  2  1  1  (N  -  1)  2 


(A4) 


(A5) 


for  N  ,  1  and  N-1  all  large;  so  that  summation  from 
1  •  [4n  N]  +  1  to  1  *  [Nt]  for  N  large  gives 


1  [Nt] 

—  E 
N  1-[tn  N]+l 


-1 


0 


Lb2-' 


(A6) 


which  Is  recognized  as  a  Rlemann  sum  representation  of 
the  Integral 


[Nt]/N  b.-l 
/  u  1  (1 

[In  N]/N 


u) 


b2-1 


du 
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(A7) 


Furthermore,  when  N  Is  large  and  1  Is  small  In 
comparison,  (A4)  behaves  like 


1 -b, -bo 
N  1 


r(1  +  b-j ) 

r(i  +  7) 


( A8) 


so  that  summation  from  1*0  to  1  *  [An  N]  for  N  large 
gives 


-b,  [An  N]  r(1  +  b.)  -b.  r([An  N]  +  b,  +  1} 

N  1  Z  - L-  •  N  1  - - -  .  (A9) 

1-0  r ( 1  +  1)  r([An  N]  +  l)b1 


which  behaves  like 


1  /[An  ^i 

F7  <  T"J 


(Aio) 


and  thus  approaches  zero  as  N  •  .  Hence,  letting 

N  ♦  •  In  (A7)  gives  the  desired  result  (A3). 
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